ls-apis needs to detect cycles in dependency unit graph#9707
ls-apis needs to detect cycles in dependency unit graph#9707davepacheco wants to merge 17 commits intomainfrom
Conversation
davepacheco
left a comment
There was a problem hiding this comment.
@sunshowers what do you think of this?
dev-tools/ls-apis/src/system_apis.rs
Outdated
| | ("mgd", "ddm-admin-client") | ||
| | ("mgd", "dpd-client") | ||
| | ("mgd", "gateway-client") | ||
| | ("omicron-sled-agent", "gateway-client") |
There was a problem hiding this comment.
Ug. Every sled-agent also talks to both MGS instances. I'm not sure it strictly needs to, but would need to talk to the networking folks more to understand if we can take it out. This happens here:
omicron/sled-agent/src/bootstrap/early_networking.rs
Lines 282 to 291 in 3b27679
The callers are:
- RSS, so it can pass a
HashMap<SwitchLocation, Ipv6Addr>to Nexus (seems like in principle Nexus could figure this out on its own?) - OPTE port creation, so it match up "which switch(es) has/have configured uplinks" with "what's the IP of the dendrite on that switch" (maybe we could always try to look up all dendrite instances regardless of configured uplinks, which means we wouldn't need to talk to MGS?)
934b040 to
5e04994
Compare
There was a problem hiding this comment.
I'm verifying the specific cases here one-by-one and have found a few mistakes so far. @sunshowers did you verify Claude's analysis here:
#9708 (comment)
I found these errors by just manually grep'ing each repo for the corresponding client. For example, looking at how sled agent uses mgs, I went into omicron, cd sled-agent, git grep gateway_client, and found a use that Claude missed. I also looked at the one it reported but I think it's incorrect -- this one is always on the local system (not localhost, but that's not the constraint we care about). Could you check these? (I stopped partway through.)
dev-tools/ls-apis/api-manifest.toml
Outdated
| [[localhost_only_edges]] | ||
| server = "ddmd" | ||
| client = "dpd-client" | ||
| note = "verified: sled-agent configures ddmd with dpd_host=[::1] (services.rs:3127)" |
There was a problem hiding this comment.
ddmd runs in both the global zone and the switch zone (see documentation at L261-L268 above). The referenced code at services.rs L3127 appears to be only the one in the switch zone. I'm not sure where the GZ one has dpd configured, or if it does. Just looking at sled 14 on rack2, the dendrite-related SMF properties are empty strings and the start method does not seem to have configured the corresponding CLI arguments:
BRM42220051 # svcprop -p config svc:/oxide/mg-ddm:default | grep dpd_
config/dpd_host astring ""
config/dpd_port astring ""
BRM42220051 # svcs -p mg-ddm
STATE STIME FMRI
online 1986 svc:/oxide/mg-ddm:default
1986 671 ctrun
BRM42220051 # ptree 671
671 ctrun -l child -o noorphan,regent /opt/oxide/mg-ddm/pkg/ddm_method_script.sh
672 /opt/oxide/mg-ddm/ddmd --with-stats --admin-port 8000 --admin-addr ::1 --ki
BRM42220051 # pargs 672
672: /opt/oxide/mg-ddm/ddmd --with-stats --admin-port 8000 --admin-addr ::1 --kind s
argv[0]: /opt/oxide/mg-ddm/ddmd
argv[1]: --with-stats
argv[2]: --admin-port
argv[3]: 8000
argv[4]: --admin-addr
argv[5]: ::1
argv[6]: --kind
argv[7]: server
argv[8]: -a
argv[9]: cxgbe0/ll
argv[10]: -a
argv[11]: cxgbe1/ll
argv[12]: --dns-servers
argv[13]: unknown
It's worth confirming the expected behavior here. Either way, the note here is incomplete.
There was a problem hiding this comment.
Looks like it's configured at https://github.com/oxidecomputer/maghemite/blob/396bb3c/smf/ddm/manifest.xml#L25 -- added a link to that.
|
|
||
| [[localhost_only_edges]] | ||
| server = "dpd" | ||
| client = "gateway-client" |
There was a problem hiding this comment.
Looking at the referenced source: it looks like someone could override this by providing a CLI argument to dpd. What do you think about seeing if we can put that CLI option behind a dev-only feature flag so that shipping code can't do this? (I'd of course talk to Levon about this before proceeding.)
(I'd do this as a follow up and just mention it in the note for now.)
Edit: maybe a simpler solution would be to remove these from the SMF start script, if that's possible? I noticed that the mgd -> dendrite dependency looks similar (can be overridden by a CLI argument), but the method script doesn't ever set the --mgs-addr option, which makes it pretty unlikely this would get broken in production.
dev-tools/ls-apis/api-manifest.toml
Outdated
| [[localhost_only_edges]] | ||
| server = "lldpd" | ||
| client = "dpd-client" | ||
| note = "verified: lldpd defaults to localhost for dpd (lldp dendrite.rs:194)" |
There was a problem hiding this comment.
Similarly, it looks like you can configure lldpd to point to a specific dpd with a CLI option. Interestingly, that one is behind a feature flag. I wonder which one we ship with?
dev-tools/ls-apis/api-manifest.toml
Outdated
| # NOTE: The following sled-agent edges are BUGS - they use underlay IPs, not | ||
| # localhost. They are kept here to avoid breaking the build, but need to be | ||
| # fixed, either through client-side versioning or by some other means. |
There was a problem hiding this comment.
This comment does not seem accurate. At least, the second item down (L646-L649) is not labeled "BUG" like the others and does not seem like a bug.
dev-tools/ls-apis/api-manifest.toml
Outdated
| [[localhost_only_edges]] | ||
| server = "omicron-sled-agent" | ||
| client = "gateway-client" | ||
| note = "BUG: uses underlay IP, not localhost (early_networking.rs:376)" |
There was a problem hiding this comment.
There appear to be two uses of gateway_client in sled-agent/src/bootstrap/early_networking.rs. The one Claude flagged actually seems fine, but there's a different one that it missed that looks problematic.
The one it flagged here is in init_switch_config(), where the comment and function imply that we are always talking about the local switch zone.
The one it missed is in lookup_switch_zone_underlay_addrs_one_attempt(), where the address comes from DNS, and so that could indeed wind up talking to a different sled's switch zone. That one does seem problematic.
db00e1a to
c74b429
Compare
|
Offline: I discussed with @sunshowers that, although I nudged us down this path, the complexity added here (both in terms of what the reader/author of the manifest has to deal with and the implementation) seems way more than I would have expected. I think it goes something like this:
There are three paths here from That's okay, but how would we relax it? This PR essentially says: with this new kind of rule, we divide filter rules into two groups: those using
One alternative I considered is to resolve this problem a different way: say that paths can match more than one rule only if all but one are |
…at the top level of the manifest
|
Okay, I've implemented the proposed code changes, along with pulling in some improvements @sunshowers made before I wound back. From my perspective what's left here is reviewing the specific metadata. I'll take a look at that tomorrow. |
Fixes #9706.
This is a prototype.